{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Synthetic Data Generation\n",
    " \n",
    "RAG Applications are often tricky to evaluate, especially when you haven't obtained any user queries to begin. In this notebook, we'll see how we can use `instructor` to quickly generate synthetic questions from a dataset to benchmark your retrieval systems using some simple metrics. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Ingestion\n",
    "\n",
    "Let's first start by installing the required packages and ingesting the first 200 rows of the `ms-marco` dataset into our local database. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2mAudited \u001b[1m7 packages\u001b[0m in 301ms\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!uv pip install instructor openai datasets lancedb tantivy tenacity tqdm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're using `lancedb` here to easily ingest large amounts of data. This is preferable since we can define our table schema using a `Pydantic` Schema and also have LanceDB automatically handle the generation of the embeddings using their `get_registry()` method that we can define as an object property."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from lancedb import connect\n",
    "\n",
    "\n",
    "DB_PATH = \"./db\"\n",
    "DB_TABLE = \"ms_marco\"\n",
    "\n",
    "# Create a db at the path `./db`\n",
    "db = connect(DB_PATH)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "from lancedb.pydantic import LanceModel, Vector\n",
    "from lancedb.embeddings import get_registry\n",
    "\n",
    "\n",
    "func = get_registry().get(\"openai\").create(name=\"text-embedding-3-small\")\n",
    "\n",
    "\n",
    "class Chunk(LanceModel):\n",
    "    passage: str = func.SourceField()\n",
    "    chunk_id: str\n",
    "    embedding: Vector(func.ndims()) = func.VectorField()\n",
    "\n",
    "\n",
    "table = db.create_table(DB_TABLE, schema=Chunk, exist_ok=True, mode=\"overwrite\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "N_ROWS = 200\n",
    "\n",
    "dataset = load_dataset(\"ms_marco\", \"v1.1\", split=\"train\", streaming=True).take(N_ROWS)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'])"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# from itertools import islice\n",
    "first_item = next(iter(dataset))\n",
    "first_item.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[\"Since 2007, the RBA's outstanding reputation has been affected by the 'Securency' or NPA scandal. These RBA subsidiaries were involved in bribing overseas officials so that Australia might win lucrative note-printing contracts. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site.\",\n",
       " \"The Reserve Bank of Australia (RBA) came into being on 14 January 1960 as Australia 's central bank and banknote issuing authority, when the Reserve Bank Act 1959 removed the central banking functions from the Commonwealth Bank. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site.\",\n",
       " 'RBA Recognized with the 2014 Microsoft US Regional Partner of the ... by PR Newswire. Contract Awarded for supply and support the. Securitisations System used for risk management and analysis. ']"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "first_item[\"passages\"][\"passage_text\"][:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "import hashlib\n",
    "from itertools import batched\n",
    "\n",
    "\n",
    "def get_passages(dataset):\n",
    "    for row in dataset:\n",
    "        for passage in row[\"passages\"][\"passage_text\"]:\n",
    "            yield {\n",
    "                \"passage\": passage,\n",
    "                \"chunk_id\": hashlib.md5(passage.encode()).hexdigest(),\n",
    "            }\n",
    "\n",
    "\n",
    "passages = batched(get_passages(dataset), 10)\n",
    "\n",
    "for passage_batch in passages:\n",
    "    # print(passage_batch)\n",
    "    table.add(list(passage_batch))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Synthetic Questions\n",
    "\n",
    "Now that we have the first ~2000 passages from the MS-Marco dataset ingested into our database. Let's start generating some synthetic questions using the chunks we've ingested. \n",
    "\n",
    "Let's see how we might do so using `instructor` by defining a datamodel that can help support this use-case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pydantic import BaseModel, Field\n",
    "\n",
    "\n",
    "class QuestionAnswerPair(BaseModel):\n",
    "    \"\"\"\n",
    "    This model represents a pair of a question generated from a text chunk, its corresponding answer,\n",
    "    and the chain of thought leading to the answer. The chain of thought provides insight into how the answer\n",
    "    was derived from the question.\n",
    "    \"\"\"\n",
    "\n",
    "    chain_of_thought: str = Field(\n",
    "        description=\"The reasoning process leading to the answer.\"\n",
    "    )\n",
    "    question: str = Field(description=\"The generated question from the text chunk.\")\n",
    "    answer: str = Field(description=\"The answer to the generated question.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we've defined this data-model, we can then use it in an instructor call to generate a synthetic question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"chain_of_thought\": \"To form a specific question from the given text chunk, I should focus on the unique details provided about the Reserve Bank of Australia, such as its creation, functions, and assets.\",\n",
      "  \"question\": \"When was the Reserve Bank of Australia established as Australia's central bank and banknote issuing authority?\",\n",
      "  \"answer\": \"The Reserve Bank of Australia was established as Australia's central bank and banknote issuing authority on 14 January 1960.\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "from openai import OpenAI\n",
    "from instructor import from_openai\n",
    "\n",
    "client = from_openai(OpenAI())\n",
    "\n",
    "\n",
    "def generate_question(chunk: str) -> QuestionAnswerPair:\n",
    "    return client.create(\n",
    "        model=\"gpt-4o\",\n",
    "        messages=[\n",
    "            {\n",
    "                \"role\": \"system\",\n",
    "                \"content\": \"You are a world class AI that excels at generating hypothetical search queries. You're about to be given a text snippet and asked to generate a search query which is specific to the specific text chunk that you'll be given. Make sure to use information from the text chunk.\",\n",
    "            },\n",
    "            {\"role\": \"user\", \"content\": f\"Here is the text chunk: {chunk}\"},\n",
    "        ],\n",
    "        response_model=QuestionAnswerPair,\n",
    "    )\n",
    "\n",
    "\n",
    "text_chunk = \"\"\"\n",
    "The Reserve Bank of Australia (RBA) came into being on 14 January 1960 as Australia 's central bank and banknote issuing authority, when the Reserve Bank Act 1959 removed the central banking functions from the Commonwealth Bank. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site.\n",
    "\"\"\"\n",
    "print(generate_question(text_chunk).model_dump_json(indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've seen how to generate a single question, let's see how we might be able to scale this up. We can do so by taking advantage of the `asyncio` library and `tenacity` to handle retries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[\"Since 2007, the RBA's outstanding reputation has been affected by the 'Securency' or NPA scandal. These RBA subsidiaries were involved in bribing overseas officials so that Australia might win lucrative note-printing contracts. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site.\",\n",
       " \"The Reserve Bank of Australia (RBA) came into being on 14 January 1960 as Australia 's central bank and banknote issuing authority, when the Reserve Bank Act 1959 removed the central banking functions from the Commonwealth Bank. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site.\"]"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chunks = table.to_pandas()\n",
    "chunks = [item for item in chunks[\"passage\"]]\n",
    "chunks[:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [],
   "source": [
    "from asyncio import Semaphore\n",
    "from tenacity import retry, stop_after_attempt, wait_exponential\n",
    "from openai import AsyncOpenAI\n",
    "import asyncio\n",
    "\n",
    "client = from_openai(AsyncOpenAI())\n",
    "\n",
    "\n",
    "async def generate_questions(chunks: list[str], max_queries: int):\n",
    "    @retry(\n",
    "        stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10)\n",
    "    )\n",
    "    async def generate_question(\n",
    "        chunk: str, sem: Semaphore\n",
    "    ) -> tuple[QuestionAnswerPair, str]:\n",
    "        async with sem:\n",
    "            return (\n",
    "                await client.create(\n",
    "                    model=\"gpt-3.5-turbo\",\n",
    "                    messages=[\n",
    "                        {\n",
    "                            \"role\": \"system\",\n",
    "                            \"content\": \"You are a world class AI that excels at generating hypothetical search queries. You're about to be given a text snippet and asked to generate a search query which is specific to the specific text chunk that you'll be given. Make sure to use information from the text chunk.\",\n",
    "                        },\n",
    "                        {\"role\": \"user\", \"content\": f\"Here is the text chunk: {chunk}\"},\n",
    "                    ],\n",
    "                    response_model=QuestionAnswerPair,\n",
    "                ),\n",
    "                chunk,\n",
    "            )\n",
    "\n",
    "    sem = Semaphore(max_queries)\n",
    "    coros = [generate_question(chunk, sem) for chunk in chunks]\n",
    "    return await asyncio.gather(*coros)\n",
    "\n",
    "\n",
    "questions = await generate_questions(chunks[:300], 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Benchmarking Retrieval\n",
    "\n",
    "Now that we've generated a list of questions to query our database with, let's do a quick benchmark to see how full text search compares against that of hybrid search. We'll use two simple metrics here - Mean Reciprocal Rank ( MRR ) and Recall.\n",
    "\n",
    "Let's start by making sure we have an inverted index created on our table above that we can perform full text search on"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "table.create_fts_index(\"passage\", replace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This allows us to then use the `.search` function on each table to query it using full text search. Let's see an example below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A rebuildable atomizer (RBA), often referred to as simply a “rebuildable,” is just a special type of atomizer used in the Vape Pen and Mod Industry that connects to a personal vaporizer. 1 The bottom feed RBA is, perhaps, the easiest of all RBA types to build, maintain, and use. 2  It is filled from below, much like bottom coil clearomizer. 3  Bottom feed RBAs can utilize cotton instead of silica for the wick. 4  The Genesis, or genny, is a top feed RBA that utilizes a short woven mesh wire.\n",
      "Results-Based Accountability® (also known as RBA) is a disciplined way of thinking and taking action that communities can use to improve the lives of children, youth, families, adults and the community as a whole. RBA is also used by organizations to improve the performance of their programs. RBA improves the lives of children, families, and communities and the performance of programs because RBA: 1  Gets from talk to action quickly; 2  Is a simple, common sense process that everyone can understand; 3  Helps groups to surface and challenge assumptions that can be barriers to innovation;\n"
     ]
    }
   ],
   "source": [
    "for entry in table.search(\"RBA\", query_type=\"fts\").limit(2).to_list():\n",
    "    print(entry[\"passage\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Metrics\n",
    "\n",
    "Now that we've figured out how we might be able to query our table using full text search. Let's take a step back and see how we can implement some metrics to quantiatively evaluate the retrieved items. It's important to note that when we want to evaluate the quality of our listings, we always take it at some subset of k.\n",
    "\n",
    "This is important because k is often constrained by a business outcome and can help us determine how well our solution works\n",
    "\n",
    "Eg. Here are some hypothetical scenarios\n",
    "\n",
    "- k=5 : We'd like to display some recommended items based of a user query (Eg. Help me plan out a dinner with Jonathan next week -> Display 5 possible actions)\n",
    "- k=10 : We have a small carousel with recommended items for a user to buy\n",
    "- k=25 : We're using a re-ranker, is it filtering out the irrelevant chunks from the relevant chunks well?\n",
    "- k=50 : We have a pipeline that fetches information for a model to respond with, are we fetching all relevant bits of information\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Reciprocal Rank\n",
    "\n",
    "Reciprocal Rank\n",
    "Imagine we're spotify and we want to suggest a couple of songs to the user. Which is a better result among the two lists of retrieved songs below? ( Note that 2 is the answer we want )\n",
    "\n",
    "- [0,1,2,3,4]\n",
    "- [0,1,3,4,2]\n",
    "\n",
    "Obviously if we're suggesting songs to the user, we want the first relevant song to be listed as early as possible! Therefore we'd prefer 1 over 2 in the example above because 2 is ordered earlier in the first case. A metric that works well for this is the Reciprocal Rank (RR).\n",
    "\n",
    "![](../img/mrr_eqn.png)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [],
   "source": [
    "def rr(results, labels):\n",
    "    return max(\n",
    "        [\n",
    "            round(1 / (results.index(label) + 1), 2) if label in results else 0\n",
    "            for label in labels\n",
    "        ]\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is an aggressive metric and once we get to an position of > 10, the value doesn't change much anymore. Most of the big changes happen at indexes < 10."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Recall\n",
    "\n",
    "Another metric that we can track is recall which measures how many of our retrieved items were retrieved. \n",
    "\n",
    "![](../img/recall_eqn.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [],
   "source": [
    "def recall(results, relevant_chunks):\n",
    "    return sum([1 if chunk in results else 0 for chunk in relevant_chunks]) / len(\n",
    "        relevant_chunks\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using Our Questions\n",
    "\n",
    "Now that we've seen two metrics that we can use and how we might be able to generate some synthetic questions, let's try it out on an actual question.\n",
    "\n",
    "To do so, we'll first generate a unique chunk id for our original passage that we generated the question from. \n",
    "\n",
    "We'll then compare the chunk_ids of the retrieved chunks and then compute the `mrr` and the `recall` of the retrieved results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('b6d9bf888fd53590ee69a913bd9bf8a4',\n",
       " \"What factors influence the average salary for people with a bachelor's degree?\",\n",
       " \"However, the average salary for people with a bachelor's degree varies widely based upon several factors, including their major, job position, location and years of experience. The National Association of Colleges and Employers conducted a salary survey that determined the average starting salary for graduates of various bachelor's degree programs.\")"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import hashlib\n",
    "\n",
    "sample_question, chunk = questions[0]\n",
    "\n",
    "chunk_id = hashlib.md5(chunk.encode()).hexdigest()\n",
    "chunk_id, sample_question.question, chunk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['b6d9bf888fd53590ee69a913bd9bf8a4',\n",
       " '7a0254c9dc709220367857dcb67f2c8d',\n",
       " '04e7e6f91463033aa87b4104ea16b477']"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "retrieved_results = (\n",
    "    table.search(sample_question.question, query_type=\"fts\").limit(25).to_list()\n",
    ")\n",
    "retrieved_chunk_ids = [item[\"chunk_id\"] for item in retrieved_results]\n",
    "\n",
    "retrieved_chunk_ids[:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now compute the results for the retrieved items that we've obtained using full text search relative to the ground truth label that we have - the original chunk that we generated it from"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1.0, 1.0)"
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recall(retrieved_chunk_ids, [chunk_id]), rr(retrieved_chunk_ids, [chunk_id])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Scaling it up for different values of `k`, where we can see how this value changes for different subsets of the retrieved items is relatively simple. \n",
    "\n",
    "We can generate this mapping automatically using `itertools.product`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "metadata": {},
   "outputs": [],
   "source": [
    "from itertools import product\n",
    "\n",
    "SIZES = [3, 5, 10, 15, 25]\n",
    "METRICS = [[\"mrr\", rr], [\"recall\", recall]]\n",
    "\n",
    "score_fns = {}\n",
    "\n",
    "for metric, size in product(METRICS, SIZES):\n",
    "    metric_name, score_fn = metric\n",
    "    score_fns[f\"{metric_name}@{size}\"] = (\n",
    "        lambda predictions, labels, fn=score_fn, k=size: fn(predictions[:k], labels)\n",
    "    )  # type: ignore"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running an Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now use the code above to run a test to see how our full text search performs for our synthetic questions. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 300/300 [00:07<00:00, 41.64it/s]\n"
     ]
    }
   ],
   "source": [
    "import hashlib\n",
    "from tqdm import tqdm\n",
    "\n",
    "fts_results = []\n",
    "\n",
    "for sample_qn, chunk in tqdm(questions):\n",
    "    chunk_id = hashlib.md5(chunk.encode()).hexdigest()\n",
    "    cleaned_question = \"\".join(\n",
    "        char for char in sample_qn.question if char.isalnum() or char.isspace()\n",
    "    )\n",
    "    retrieved_results = (\n",
    "        table.search(cleaned_question, query_type=\"fts\").limit(25).to_list()\n",
    "    )\n",
    "    retrieved_chunk_ids = [item[\"chunk_id\"] for item in retrieved_results]\n",
    "\n",
    "    fts_results.append(\n",
    "        {\n",
    "            metric: score_fn(retrieved_chunk_ids, [chunk_id])\n",
    "            for metric, score_fn in score_fns.items()\n",
    "        }\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "mrr@3        0.784267\n",
       "mrr@5        0.791267\n",
       "mrr@10       0.797633\n",
       "mrr@15       0.798133\n",
       "mrr@25       0.798433\n",
       "recall@3     0.896667\n",
       "recall@5     0.926667\n",
       "recall@10    0.973333\n",
       "recall@15    0.980000\n",
       "recall@25    0.986667\n",
       "dtype: float64"
      ]
     },
     "execution_count": 115,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.DataFrame(fts_results)\n",
    "df.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that on average full text search is able to surface the relevant item 97-98% of the time if we take `k=10` and that we have the relevant item in between the first and second item here.\n",
    "\n",
    "Now, because these are synthetic question, there's likely to be a large amount of overlap in the phrases used in the questions and the original source text, leading to the high values.\n",
    "\n",
    "In actual production applications and your domain specific dataset, it's useful to do these experiments and see what works best for your needs."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
